A final report for Grant NAG-1-981 entitled 




/ / 5 

^ /7 

STUDY ON FAULT-TOLERANT PROCESSORS FOR 
ADVANCED LAUNCH SYSTEM 


Kang G. Shin and Jyh-Cham Liu 

Real-Time Computing Laboratory 
Department of Electrical Engineering and Computer Science 
The University of Michigan 
Ann Arbor, Michigan 48109-2122 

(313) 763-0391; e-mail: kgshin@dip.eecs.umich.edu 


Prepared for 

NASA Langley Research Center 
Mail Stop 130 
Hampton, VA 23665 

Attention: Felix Pitts and Allan White 


November 27, 1990 


(NASA-CR-ig$053O STUDY ON FAULT-TOLERANT 
PROCESSORS FOR AuVANCFD LAUNCH SYSTEM Final 
Report (Michigan Univ.) 35 p CSCL 22 


N91-12721 


G3/15 


Unci as 
0317945 


ABSTRACT 


Issues related to the reliability of a redundant system with large main memory are 
addressed in this report. The Fault-Tolerant Processor (FTP) for the Advanced Launch 
System (ALS) is used as a basis for our presentation. When the system is free of latent faults, 
the probability of system crash due to multiple channel faults is shown to be insignificant 
even when the outputs of computing channels are infrequently voted on. Using channel error 
maskers (CEMs) is shown to improve reliability more effectively than increasing redundancy 
or the number of channels for applications with long mission times. 

Even without using a voter, most memory errors can be immediately corrected by those 
CEMs implemented with conventional coding techniques. In addition to their ability to 
enhance system reliability, CEMs — with a very low hardware overhead — can be used to 
dramatically reduce not only the need of memory re-alignment, but also the time required 
to re-align channel memories in case, albeit rare, such a need arises. Using CEMs, we 
have developed two different schemes to solve the memory re-alignment problem. In both 
schemes, most errors are corrected by CEMs, and the remaining errors are masked by a 
voter. 

Index Terms — Access sets, active agent, Fault- Tolerant Processor (FTP), fault register, 
latent errors, channel error maskers (CEMs), random access memory, recovery page, relia- 
bility, single channel fault model, voting. 
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i.e., it takes several steps to vote on a message/data, each step taking approximately 5 
/xseconds, when compared to the instruction cycles of contemporary microprocessors. This 
in turn implies that FTP’s channel outputs should not be voted on frequently. As embedded 
systems are becoming increasingly complex, one must carefully investigate the dynamics of 
system failure for life- critical applications with long mission times. 

The first important problem to be addressed in this report is the probability of system 
failure due to multiple channel faults in FTP. By developing a realistic system model, 
we shall first show this probability to be negligible. Then, we analyze the probability of 
resource exhaustion as a result of failures for applications with long mission times. A serious 
drawback of low-speed (slack) voters is that when channels have large main memory, it is 
very time-consuming to re-align all channels with a slack voter into an identical state. 

To alleviate the difficulty associated with memory re-alignment, a monitoring technique 
using signature analysis was proposed in [3]. In this method, main memory is decomposed 
into signature pages, and memory accesses to each page are encoded into a signature which 
is then stored in an independent signature memory. A fault is thus detectable only when 
a faulty word is accessed. Upon detection of a fault in main memory, only those pages 
with different signatures need to be re- aligned. The signature analysis cannot completely 
overcome the memory re-alignment problem, because even though a massive redundant 
system may have congruent inputs, errors caused by random permanent/transient faults 
that occur in memory cells cannot be detected by this technique. Thus, in the worst case, 
the whole main memory must be realigned when such faults occur. 

Channel failure rate has the most pronounced effects on system reliability, because it (i) 
determines when a resource exhaustion occurs, and (ii) affects the process of fault detection, 
fault location, and reconfiguration. When channel failure rate is high, the success of a 
mission will be greatly affected by the quality of fault handling processes. On the other 
hand, when channel failure rate is low, rare occurrences of faults will lower the demand on 
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system resources — such as hardware, computing time, and software routines — for fault 
handling. System design would thus be greatly simplified if the channel failure rate can be 
effectively reduced. 

As a solution to both the reliability enhancement and memory re-alignment problems, we 
propose to use channel error maskers (CEMs). The main motivation behind this proposal 
is to make channels more reliable by masking and/or ccorrecting channel faults with CEMs. 
When the reliability of each channel is improved, the need of memory re-alignment can be 
reduced significantly. It is shown that the reliability of ALS is dramatically improved when 
the CEMs for main memory are implemented with common single error correction/ double 
error detection (SEC/DED) codes. Furthermore, using CEMs can speed up the memory re- 
alignment process substantially, because only those faulty words uncovered by CEMs need 
to be re-aligned by the voter. Two different schemes, called Scheme_l and Scheme_2, are 
developed for the re-alignment of channel’s main memory. In Scheme.]., main memory is 
decomposed into recovery pages, and a page is re-aligned only when CEMs cannot recover it. 
An optimization technique is developed to determine the optimal page size for Scheme-1. 
In Scheme_2, the addresses of faulty memory words are recorded, and only those recorded 
faulty words need to be re-aligned. 

In order to assess FTP’s capability for the ALS mission, the basic operational principles 
of FTP are first introduced in Section 2. In Section 3, we develop a reliability model which 
is then used to evaluate FTP’s reliability for ALS. CEMs are then applied to solve the 
memory re-alignment problem in Section 4. The report concludes with a few remarks in 
Section 5. 
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2 Review of FTP Architecture 


The FTP architecture, and the memory access model for the analysis of multiple channel 
faults, are introduced in this section. An FTP channel may have one or two processors. 
When two processors are built into each channel, one processor is dedicated to computation 
functions and the other to I/O functions. The two processors in a channel communicate 
with each other by writing and reading messages in shared memory. There are interval 
timers and watchdog timers in each channel for task scheduling and time-out interrupts, 
respectively. 

FTP can provide high performance, and its architecture can be in the form of simplex, 
duplex, triplex (TMR), or quadruplex (QMR). In a redundant FTP system, only clocks in 
the different channels of FTP are tightly synchronized, and thus, fault-free channels can 
execute identical instructions in lock-step. Channels communicate with each other via a 
network formed by the communicators in redundant channels. 

A block diagram of the communicator network in a QMR FTP is given in Fig. 1 where 
a co mmuni cator is composed of a set of registers, a transmitter (single input, N outputs), 
interstage (single input, N outputs), and a receiver ( N inputs, single output). There are 
four channels, called channel A,B,C and D, respectively. The set of registers Xy, Xr, Xe, 
and Xy in channel Y,Y£ {A,B,C,D}, store inputs and outputs of channel Y to/from 
the communicator network. 

Logically, channels exchange data by reading/writing data from/to the set of regis- 
ters in the communicator. Data communications between channels are classified as voted 
data-exchange, and simplex data- exchange. FTP design emphasizes the concept of source 
congruency: for all types of data-exchanges, all correctly operating channels will eventually 
receive identical copies of data. A voted data-exchange allows channels to compare their 
outputs and mask any error whenever possible. A voted data-exchange is accomplished by 
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writing a value to Xy, and then reading the voted result from Xr. Register Xe in each 
channel records any discrepancy between its Xy and Xr values. The actual steps in a voted 
data-exchange are that (1) every channel sends a message to its transmitter which will then 
relay the message to its own interstage, and (2) through the fully-connected network from 
N interstages to N receivers, the receiver in every channel gets a voted message and stores 
it in Xr. 

A simplex data-exchange can be used by a channel to broadcast messages to the other 
channels. For example, if a message needs to be transferred from channel A to the others, all 
channels execute an instruction “write message $ to Xa”. When the instruction “write $ 
to Xa” is executed by channel A, $ will be broadcast via the transmitters to all interstages. 
In the meantime, the pseudo-messages $ sent by channels B, C and D are discarded by the 
communicator network. After all interstages receive replicated copies of $, $ is broadcast 
on the interstage network, and every receiver will have the voted $ stored in Xr' s. Through 
such an exchange process, data congruency is guaranteed for both voted and simplex data- 
exchanges. 

3 Reliability Analysis 

To justify the use of a single fault model, the probability of multiple channel faults will 
first be shown to be negligibly small. Then, using this single fault model, the reliability 
impact of CEMs on the ALS will be analyzed. 

3.1 Probability of Multiple Channel Faults 

A complete reliability model of FTP must include both single channel and multiple 
channel faults. However, incorporation of multiple channel faults into the reliability model 
will make the analysis very complex, because it deals with a multivariate distribution. To 
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alleviate the complexity, calculation of the probability of multiple channel faults is separated 
from that of the probability of system crash caused by resource depletion. 

Since a channel’s operation depends heavily on its access to main memory, a memory 
access model needs to be developed for the evaluation of multiple channel faults. The 
existence of memory access locality has long been the key to the modeling of memory 
access behavior. That is, once a program starts to access a specific memory area, it tends 
to access the area continuously for a certain period. Thus, although memory cells axe 
physically identical, different parts of main memory must be distinguished when they have 
different logical uses. 

Using memory access locality, a program’s memory access behavior can be modeled by 
an active agent visiting main memory and forming access sets [4]. An access set is defined 
as a memory area that is continuously accessed for a certain period of time during each visit. 
We further assume that (1) locations of each access set in the system does not change during 
the time of interest, (2) the number of access sets in the system is fixed, and (3) all access 
sets have the same size u and are disjoint with one another. Based on these assumptions, 
an access set can be used to denote a set of “physical” memory cells in a certain area. 

Let Vj denote the event of the agent’s j-th visit to AS' and Afj G 3? + U {0} be the Vj's 
lifetime, where {A5 1 , AS 2 • • • AS m } are m access sets in main memory. Mj’s axe assumed 
to be independent and identically distributed (i.i.d.) random variables, and the agent’s 
present and future visits are independent of its past visits. Based on this memory access 
model, a multiple channel fault occurs when more than one agent (i.e., channel processor) 
either become faulty or visit faulty access sets during one inter-voting period. 

To calculate the probability of multiple channel faults, it is assumed that memory is 
initially free of faults and is not re-aligned when the voter can mask faulty channel outputs. 
Voters are assumed to be the only fault detection/masking mechanism in the system. Once 
the processor enters an access set, any of the faults in the processor, voter, or the access 
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set will cause a faulty output on the channel. Note that faults in one access set do not 
propagate to another access set. For convenience of presentation, all processor and voter 
faults are classified as access set faults. Since the interval between two successive component 
failures is very large as compared to memory access times, a large number of access sets 
will be visited between two successive fault occurrences, and no new fault is likely to occur 
in an access set while it is being visited. Faults occur independently according to a Poisson 
process. 

A system with m access sets is said to be in state i when the agent is visiting AS'. Let 
S' k be the time V k begins — the agent’s k-th. visit to access set AS' — and let N'(t) = sup 
{k I S' k < t}, N'(t) € I + U {0},Vi,f. For a given time interval [0, <), t > 0, the total number 

m 

of visits made by the agent to access sets is N(t) = N k (t). One or more access sets 

fc=i 

may have been visited before the channels vote on their outputs. Let the random variable 
Xf € I + U {0} denote the number of faults occurred in AS* during the agent’s i-th visit (to 
some access set). Since faults occur uniformly within memory and X/’s are assumed to be 
ij.d., X/’ s will be represented by a single random variable X. Thus, at any time instant, 
the agent’s decision on which access set to visit makes no difference. Let Yi = T, - T,_j 
where Tj is the time of the jf-th voting on channel outputs. When Yi’s are i.i.d., they can 
be represented by a single random variable Y. 

Assuming that the agent has visited l access sets during [0, Tj_j), at time T+_ x there are 
IX faults in AS'. In an NMR system, let P c (Y t ) be the probability of a channel generating 
a faulty output during the time interval [Tj-\,Tj). During [0,Tj), the total probability of 
system crash due to multiple channel faults becomes 

p s{Tj) = Yj H f^l-PKno cras h before T fc _i)(P c (y fc )y(l - P c {Y k )) N ~' (3.1) 

fc=i, =r ^l w 

< EE ^(^(nmi-PcCn)) 7 "-*'. (3.2) 

v ' 
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To evaluate P c (yi), within [T,_i,Tj) the probability of a faulty output generated by a 
channel is 

P c (Yj\w visits between two successive votings) = 1 — ( P(X = 0)) w 

_ j _ e -wu\Y 

tv wuXY (3.3) 

where u and A are the size of access set and the failure rate of a memory word, respectively. 
When w = 1, i.e., channels vote on their results after accessing each access set, Eq. (3.2) 
can be simplified as 

pn(t,) < i £; fT)(“ Ay )'( l -“ Ay ) w "'- < 3 - 4 ) 

i-rfi v > 


3.2 Analysis of ALS 

In this subsection, the probability of system crash due to multiple channel faults and 
the effectiveness of CEMs are discussed using the ALS mission scenario. The ALS will first 
sit on the launching pad for a week, and will then be in the boost phase for 10 minutes. Any 
approval for launch requires the system to have fault masking capability. The system must 
have 0.95 probability of availability, 0.98 probability of mission success, and less than 10 -5 
system unreliability at the end of mission. Since information on the maintenance schedule 
and the requirement for mission success are not available, we will focus on system reliability 
and the probability of system possessing fault masking capability before launch. 

The parameters necessary to estimate the reliability of ALS axe derived from the results 
of the Entry Research Vehicle (ERV) study. Permanent failure rates of the processors 
(including control circuits) and the interstage are predicted to be X p = 8.91 X 10 _6 /hour, 
and A; = 1 X 10 -6 /hour, respectively [5]. Permanent failure rates of 64 K x 4 RAM chips and 
128A x 8 ROM chips are predicted to be 6 x 10 _6 /hour, and 2.8 x 10 -6 /hour, respectively. 
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A redundant FTP equipped with 1M bytes of ROM and RAM in each channel is considered 
as an example ALS controller. Thus, the main memory needs 32 (8) RAM (ROM) chips, 
and the total failure rate of RAM (ROM) is A m = 192 x 10 -6 /hour ( X a = 20.8 x 10 _6 /hour). 
Note that the above failure rates have been adjusted by the environment and quality factors, 
H e and n„ i.e., X x = IIA r , where x 6 {p, i,o, m}, and II = n e II 9 . Since II, = 0.5 and 
II e = 3 in the ERV study, the actual component failure rates are A p = 5.94 x 10 _6 /hour, 
A 0 = 13.86 x 10 -6 /hour, Aj = 1 x 10 -6 /hour, and A m = 128 x 10 -6 /hour, respectively. With 
these parameter values, one can see that in the ALS, 96% of the channel faults are caused 
by main memory faults. This can be broken down to 86.8% of the faults due to RAM and 
9.2% due to ROM. 


The system cycle of FTP is 40 msec, within which all the essential control functions, 
including fault recovery processes, must be completed for the system to function acceptably. 
It takes about 11 /^seconds to vote on one memory word — the processor reads a memory 
word, votes on it, reads the result back from the voter, and then writes the voted word back 
to the memory. Because of the relatively low system failure rate and the frequent memory 
scrubbing, it is reasonable to assume that the system is free of latent faults. 


Note that when a fault occurs in the access set that is currently being visited by the 
agent, the fault cannot be detected/corrected by the scrubbing process, because the scrub- 
bing process is given the lowest priority. In the FTP, computing channels vote on their 
outputs at least once every 40 mseconds, i.e., T,- — T,_i < 40 mseconds. Thus, given that 
no fault occurs before T,_i, and m V — A a , where A a is the failure rate of an access 
set (including processor, interstage, memory and the access set itself), the probability of 
system crash due to multiple channel faults in an NMR system during Yi = [T,-,Ti_i) is 


Pc(Yi) 



c -(w-j)A a r i ( c -A.T._ i A aYi y 
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(3.5) 


"V-NVT.-.^yy 

For example, within [0,2), the total probability of system crash due to two channel errors 
is 

< f ( 2) (VXA “ )2 + 0(f ‘ ) (3 ' 6) 

< t6Y\ 2 a . (3.7) 

In Eq. (3.7), the probability that the system does not crash before t is not considered, 
and 0(h) is the probability of 3 or more channels becoming faulty simultaneously. The 
probability of system crash due to multiple channel faults in the FTP is plotted in Fig. 2. 
This probability is shown to be very small even when the the size of access set is very large. 

After the probability of multiple channel faults is evaluated, a continuous-time Markov 
model can be developed for the reliability analysis of a QMR system due to resource exhaus- 
tion. As shown in Fig. 3, states A, B, C, D and E are used to denote the conditions where 
the system has four, three, two, one fault-free channels, and system crash, respectively. The 
model can be modified for a TMR system with state A removed. In this model, A ( (A/,) 
is the failure rate of transient (hard) faults, c is the recovery coverage of transient faults, 
and cj is the reconfiguration coverage of a duplex configuration. Assuming that a channel 
will be retired if any of its components becomes faulty, the total failure rate of a channel is 
A c = A p + A m + A 0 + A, . (See Appendix A for definitions of A’s). 

A similar, but more complicated, FTP reliability model has been developed by CSDL 
[5]. In the CSDL’s model, every component failure is considered to be an independent event, 
and the system reconfiguration time is treated as a random variable with an exponential 
distribution. Our model differs from the CSDL’s in that (1) system states are defined 
by the number of fault-free channels, (2) different component failures in one channel are 
aggregated into one single event, because when component failures are memoryless, and 
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reconfiguration rates for different component failures are the same, the channel failure rate 
is the sum of component failure rates, and (3) system reconfiguration is considered to 
be done instantaneously, because it is usually done in one system cycle, 40 mseconds, or 
9000/hour, which is extremely fast relative to faults’ inter-arrival times. 

Next, we want to evaluate the effectiveness of CEMs. A channel with embedded CEMs 
will be retired if CEMs become faulty. Thus, the channel failure rate becomes A C = A£ + 
Af + Af + A£ + (1 - c p )Ap + (1 - c,)Aj + (1 - c 0 )A o + (1 - c m )A TO , where Cp,c;,c 0 , and c m 
(Ap,Af ,Af, A£) are the coverage (failure rate) of CEMs for processors, interstage, ROM 
and RAM, respectively. 


As mentioned earlier, 96% of channel failures axe due to main memory failures. Thus, 
adding CEMs to main memory can dramatically reduce the channel failure rate. On the 
other hand, CEMs could be designed for processors (and control circuits), but this is more 
difficult and has little impact on system reliability, since only 4% of channel failures result 
from this portion of hardware. Consequently, the design of CEMs for processors will not 
be considered any further. Note that CEMs would be inefficient if they could not achieve 
high fault coverage during the mission. Assuming that CEMs for memory can correct w 
bit-errors in an n-bit word, and faults in memory bits are independent of each other, one 
can derive the coverage of CEMs at time t as: 


Cm(^o) 


-■“=*• A* 


g- Q(i - e 


S-A t 


(3.8) 


where A is the failure rate of a memory word. For the FTP example, if we use 7 extra 
bits to encode a 32-bit data word by SEC/DED codes, we get A = x 10 -9 /hour-word, 
and c m « c 0 « 1 - 2 x 10 -7 at t = 200 hours. However, when multiple-bit chips axe 
used, other coding schemes should be employed, such as those in [6], to provide high fault 
coverage. Since the implementation of CEMs for main memory is straightforward with 
standard commercial error controllers (e.g., 74ALS632B), they will not be discussed any 
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further in this report. 

Evaluation of the reliability of a redundant system with CEMs is very simple when 
the system has perfect reconfiguration capability. For example, consider two redundant 
systems with N and W computing channels, respectively. CEMs are embedded into the 
NMR system, denoted as NMR-CEM, but no CEM is embedded into the WMR system. Let 
the channel failure rate of NMR-CEM (WMR) be A' (A c ), and N < W, then the probability 
of NMR_CEM (WMR) crash before time t is P/v(0 = (1 — e~ x ' ct ) N (ffy(f) = (1 - e - ^)^). 
When A t < 1 (A 't < 1), P N (t) * (A'*)* (JV(<) « (A t) w ). Thus, an NMR-CEM is more 
reliable than a WMR system when A 7 < A y . Note, however, that a numerical 

method is usually called for when systems do not have perfect reconfiguration capability. 

Using the component failure rates predicted by the ERV study, numerical solutions of 
the ALS reliability with and without CEMs are calculated with METASAN [7]. Let the 
failure rate of CEMs for memory be the same as that of an interstage, the coverage of 
transient faults be 1.0, and the coverage of duplex system be 0.9, i.e., c p = c, = 0, A^ = A, , 
c = 1.0, and cj = 0.9. The probability of system crash while sitting on the launch pad for 
TMR and QMR systems with and without CEMs are plotted in Figs. 4 and 5. The two 
diagrams in Fig. 4 (5) show the reliability impacts of CEMs when II = 0.1 and II = 1, 
respectively, where II is a adjusting factor of channel failure rate. In Fig. 4, SEC/DED 
codes are embedded into RAM only, and in Fig. 5, SEC/DED codes are embedded into 
both ROM and RAM. Clearly, a TMR system with the entire memory (ROM and RAM) 
encoded is more reliable than a conventional QMR system even for very short missions 
and very low component failure rates. Furthermore, while the reliability improvement by 
changing from a conventional TMR to a QMR system is in the order of 10 to 100, when 
CEMs are embedded into main memory, the reliability improvement by upgrading a system 
from TMR-CEM to QMR-CEM is in the order of 10 3 to 10*. 

1 METASAN is a registered trademark of the Industrial Technology Institute. 
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The probability of FTP retaining fault masking capability for the ALS is examined 
next. As shown in Fig. 6, the probability that a conventional QMR system retaining fault 
masking capability decreases quickly with increases in II (i.e., component failure rates) and 
launch waiting times. On the other hand, since channels in TMR-CEM or QMR_CEM axe 
inherently reliable, the probability of launch approval increases substantially even for very 
long waiting times. 

Finally, the total system reliability throughout the mission can be derived as follows. 
The system unreliability is the sum of the probability of system crash before launch, and the 
probability of system crash during the launch. Since the system cannot be launched unless 
the FTP retains fault masking capability, we can calculate the probability of system crash 
during the boost period conditioned on that the FTP has fault masking capability. When 
the boost time is less than 20 minutes, the probability of system crash during the launch is 
lower than 10 -7 for systems without CEMs, and the figures are much lower for systems with 
CEMs. Thus, the probability of system crash during the on-pad waiting period is much 
higher than the probability of system crash during the boost phase. 

4 Memory Re-alignment 

Application of CEMs to the memory re-alignment problem is the subject of this section. 
In a conventional QMR system, the probability that the channels need to be re-aligned is 
P r (t) = 1 — e -4A,< , where A t is the transient failure rate of RAM. When X t = 128 x 10 -5 , we 
get P r (200) ss 0.64, implying that memory faults should be a serious threat to any system 
design. 

Theoretically, when a transient fault occurs in memory, the fault can be corrected by 
memory re-alignment. However, since it is very time-consuming to re-align channel mem- 
ories, and since it is difficult to discriminate permanent, intermittent, and transient faults 
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in a limited amount of time, it is highly desirable to correct faults, if possible, by CEMs 
without using memory re-alignment. For example, when SEC/DED codes are embedded 
into main memory, the transient failure rate is reduced by a factor of 2 x 10 -7 . Plugging the 
new failure rate into P r (f), we get P r (200) ss 2 x 10 -7 . Thus, for the ALS mission scenario, 
channels’ main memory re-alignment is unlikely to be called for when CEMs are embedded 
into main memory. 

In addition to dramatically reducing the need of memory re-alignment, the fault-masking 
capability of CEMs can be used to speed up the process of memory re-alignment substan- 
tially. Two schemes, called Scheme_l and Scheme-2, are developed for the re-alignment 
of main memory. In Scheme_l , the entire memory space of W words is decomposed into 
K recovery pages, ’&Ki where |fli| = jr,i < K. When the system decides to 

start memory re-alignment, all channels scan through main memory page-by-page. After 
each page of different channels is scanned, channels have the scanned page re-aligned if 
any one of them is found to be faulty. The procedure is repeated until the entire memory 
system is completely scanned and/or re-aligned. When two pages have different sizes, we 
can repeatedly subtract 1 byte from the page of larger size, and add 1 byte to the other 
until the difference of their page size is less than, or equal to, 1. Thus, when ^ is not an 
integer, there is at most one byte difference among pages. Since the reliability difference 
and re-alignment overhead caused by the one-byte difference in page size is negligible, it is 
assumed that K can always divide W without leaving a non-zero remainder. 

In Scheme-2, the entire memory is decomposed into fli and O 2 , where flj is a fault 
register of variable size, and CI 2 is the rest of main memory. When main memory needs to 
be re-aligned, the CPU in each channel scans through its main memory and places addresses 
of faulty words into its fault register. After all channels complete their memory scan, they 
use simplex data-exchanges to broadcast addresses of faulty words, and then vote on each 
faulty word using the voted data-exchange. 
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Details of Scheme.1 and Scheme-2 are described in pseudo codes as follows. 


Scheme_l ( channel- i ) 
begin 

Synchronize channels to start the re-alignment 
n = 1; 

while ( n < K ) /*scan recovery pages, where K is the number of pages*/ 

do 

A= “fault-free”; /*The current page is assumed to be fault-free */ 
scan Q„; 

if (fl n faulty & cannot be corrected by CEMs) A= “faulty”; 
write A to Xy', 

if (Xr= “faulty” or Xe / 0 ) /*at least one channel has a faulty page */ 
do /*re-align fl n */ 

j=i; 

while (j < 

do write D„(j) to Xy, 
write Xr to D„(i); 

j=j+i; 

end-do 

end-do 

end^do 

end 

Scheme-2 ( channel-i ) 
begin 

Synchronize channels to start the re-alignment 
j = 1; 

* = i; 

while ( j < W ) /*scan main memory, where W is the total memory size*/ 

do 

read M(j)\ /*read the jT-th word*/ 

if ( M(j ) faulty & cannot be corrected by CEMs) 

do 

write j to fli(fc); /*find a faulty word, and record its address in the fault register */ 
k=k+l; 
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end-do 

end-do 

write “EOF” to fti(fc); /*channel_t finishes scanning */ 
write “Ready” to Xy, 

while ( Xe / zero) write “Ready” to Xy; /*wait until all channels finish scanning*/ 
for n=A to D /* re-align faulty words one by one, starting from channel A to D*/ 
k=l; /* pointer of channel n */ 
while (X fl ^“EOF”) 

do 

write Cl\(k) to X n ; /*only channel-n can make a simplex data-exchange 

other channels’ write commands will be ignored by the system */ 
read Xr; /*every channel reads the address of the faulty word in channel_n*/ 
if {X R ± “EOF”) 

do 

T = Xr; /*Xr contains the address of the faulty word*/ 
write M(T ) to Xy; /*channels vote on the faulty word*/ 
write Xr to M(T ); /*channels write the voted result back*/ 
end_do 
k=k+l; 
end -do 

end 


Scheme-1 is more robust than Scheme_2, because in Scheme_l all channels are 
executing identical instructions in lock step, and any mismatch between channels can be 
easily detected. Thus, fault-free channels can always complete memory re-alignment without 
being affected by faulty channels. On the other hand, Scheme_2 is faster but more prone 
to errors, because the completion of memory re-alignment can be guaranteed only when 
faulty channels can correctly interact with fault-free channels. For example, if the CPU 
program counter in one channel stops at a certain point, all the other channels running 
Scheme-2 will be stuck in waiting loops. Although this problem can be easily fixed by 
adding a time-out to each waiting loop, Scheme_2 needs a substantial modification to 
make it robust. 
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Both Scheme.1 and Scheme.2 induce a fixed time overhead, Wt m , to scan main 
memory, where t m is the memory cycle time. (Due to its unimportance to the optimization 
problem to be discussed, this fixed overhead will not be mentioned in the rest of this section.) 
The performance overhead of Scheme_2 is linearly proportional to the total number of 
faults, whereas Scheme-1 may be substantially slower than Scheme-2 , i.e., Scheme-1 is 
faster than Scheme-2 only when g > K + rn-j£, where g is the total number of faults, and 
m is the number of re-aligned recovery pages, because the value of K in Scheme_l can be 
greater than the value of g in Scheme-2. 

The speed of Scheme-1 is primarily determined by the size of recovery page and system 
reliability. Denote the number of recovery pages to be re-aligned by a random variable F, 
0 < F < K. Then, the memory re-alignment time is 

W 

t Ta = (K + F—)t v , 

where t v is the time to take a vote, i.e., the total time to write Xy, and read Xr and 
Xe • From Eq. (3.8) it is not difficult to see that the perfect fault detection assumption is 
reasonable even when the channel failure rate is high and OEMs have only fault detection 
capability, e.g., even/odd parity codes. When CEMs have only perfect detection capability, 
the probability of / faulty recovery pages having occurred in the system by time t is P k (F = 
/) = (^)Ap^ ^(f)(l — R p (t)) f , where 1 — Rp(t) is the probability that one or more of the 
recovery pages which tire in different channels but have the same page number are faulty. 
Let A and q denote the failure rate of a memory word and the number of redundant channels, 
respectively, then Rp(t) = Let rfr = WqXt, and K have been determined, then the 

conditional probability of / recovery pages needing to be re-aligned is 

W ( i ^)e -v ’( 1_ K)(l - e~K)f 

Pldtra ) = P(t Ta = K + /— Imemory is faulty) = -*■ — - . (4.1) 

The objective of recovery page design is to minimize the re-alignment overhead so that 
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at time t, the probability of re-alignment requiring more than a time period T is less than 
e. Therefore, a solution K is feasible when P/f(t r o > T) < e. The optimization problem is 
essentially a non-linear integer programming problem, and can be stated formally as follows: 

min Z(t) = K + F ^ 

subject to Kel + ,K<W 

Ph'(tra = K + f%) = - e=&y/{ 1 - e~*) 

Pk{U a >T) < €. 

When T > Wt v , the recovery page design is trivial, because the memory can be easily 
re-aligned by voting on every word. When T < Wt v , and the recovery coverage of CEMs 
is c, no solution can be found if the optimal page size based on the given c is not feasible. 
Since the coverage of CEMs is very high, the design problem can be focused on page size 
optimization , while the feasibility problem can be easily solved by an exhaustive search. 
When K* 1, an exhaustive search for K* is the only course to take. On the other hand, 

when K * 1, it will be shown that K* can be found through a conventional continuous 

variable optimization technique. 


Lemma 1 Given t/> and K, Pk{F = /), the probability of / faults simultaneously occurring 
to the system, is a monotonically decreasing function of / when — 1) < 1, 1 < / < 

K. The sufficient condition for Pk{F = /) to be a monotonically decreasing function of / 
is (e£ - 1) < K > 1. 


Proof: Since Pk(F = f) > 0, Pk(F = /) is monotonically decreasing if < 1* 

V/. Using Eq. (4.1), we have - 1), or Pk(F = /) is monotonically 

decreasing if - 1) < 1. Note that 0 < (e^ - 1) < 1 when $ < 0.693. Since 

^ T7+i j+V » the maximum value of is K > 1, the sufficient condition 

for the ratio test to hold is (e$ - 1) < t £ y, K > 1. I 
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Lemma 2 When the sufficient condition of Lemma 1 holds, P(t ra > K + ) < Pk(F — 

/) l ~i%, > where PJ = - !)■ 

Proof: When the sufficient condition of Lemma 1 holds, Pk(F — /+ 1)/ Pk(F = /) < 

K 

Hf < 1. Since fif < /i/ +1 ,V/, we have ^TP k (F = t) < P/c(F = /)(! + A*/ + /*/ ' ' ' + 

»'=/ 

or P(t ro > K + /£ ) < P*(F = /) ~if— • ■ 

Note that il> <. K holds for most realistic parameter values. When Lemma 1 holds, and 
if and c axe given, /it = inf /;, such that P(t ro > if + fij?) < e,Vi, can be determined 
by applying Lemma 2 repeatedly. The next lemma states a key condition that can greatly 
simplify the optimization algorithm. 

Lemma 3 If Ki(Ki) > 1, K\(Ki) » Vs and K 1 ( 1 ( 2 ) > /, then Pk x (F = /) « Pk 3 (F = 
/), where Pk,(I) 1S the probability of F = / when the number of recovery pages is F t . 

Proof: P*(F = /) = When K > /, (*) « and « 

e - ^. Furthermore, when K > we get 1 — ® 1 — (1 — Combining the 

above expressions leads to Pk(F = /) « ■> or Pk(F = /) « That is, 

P/r(F = /) is predominately determined by /, and is insensitive to K. Thus, Pk x (F = 
/) « P* 2 (F = /) holds. ■ 

Lemma 3 is valid for a broad range of K values, and when K\,K 2 1, Pk,(F = /)’ s 

are very close to each other. When Lemma 1 holds, Pk x (F = / 1 ) < Pk 2 (F = h), where 
fi > h- Pk(F = /)’s with different K values are plotted in Fig. 7. In these examples, 
the system has W = 4 M words of memory, q = 4 channels, A = 0.75 byte/10 9 hours, 
and t = 150 hours. Thus, V> = AtgW = 1.8, ]Psoo(F = 1) — Pi 3 soo(F = 1)| < 0.05, and 
|P 50 o(F = 3) — Pi 35 oo(F = 3)| < 0.001. Denoting the optimal value of K by K *, the most 
desirable property of Lemma 3 is that when K* > 1, we get fa ~ fa*, and thus, K* can 
be found by the following Theorem. 
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Theorem 1 When K* > 1, K* « \f fxW, where if is an arbitrary integer, 1 < K < W. 


Proof: From Lemma 3, we get Px x (F = /) w Px 2 {F = /),V/. Thus, when K* > 1, 

we have fx « fx*, or Ik" can be found by applying Lemma 2 to an arbitrary K such 

that P(f > fx) < e. Clearly, for a given e, fx « /, Viif > 1, where / is some constant. 

-W 

The cost function Z(t) to be minimized can be expressed as min (if + /— ). Since the 
objective function is convex when K is continuous, the optimal solution of real-valued K's 
is K' = y/JxW. Then, K* can be found by an exhaustive search in [ K ' — 6,K' + 6], where 
6 is some constant yet to be found. ■ 

An example cost function Z(t) is plotted in Fig. 8. The curve shown in Fig. 8 is 
K + fx • It can be seen that the integral constraint on K and [^J causes the sawtooth 
curve in [ K — A K,K + A if], but has only a small impact on the global curve shape. In 
this example, e = 10 -5 , ^ — 1-8, and thus fx = 10. Thus, K' = \/l0 x 4 x 10 6 = 6324.5. 
Through an exhaustive search, it is found that there are multiple optimal solutions, and 
the one closest to K' is 6320. The discrepancy between the result obtained from Theorem 1 
and the exact solution is due to the integral constraints on K and ^ . Thus, having found 
if', the optimal solution can be easily found by K* = min K, |^J = . However, from 

a practical viewpoint, the difference between K' and K* is less than 0.1 percent, and thus, 
it is reasonable to use [if , J as an optimal solution. 

From the above example, we can see that even when CEMs have fault detection capa- 
bility only, the performance of Scheme_l is nearly thousand times better than voting on 
every word. The performance will be further improved if CEMs also have fault recovery 
capabilities, which is usually the case. Using the example shown in Fig. 8, W = 4 x 10 6 and 
1 - C c « 2 x 10 -7 for SEC/DED codes, we get & = 1. and K* « 2000, when e remains 
the same (10 -5 ). 

Cost functions for systems with and without SEC/DED codes are plotted in Fig. 9. 
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When the memory access time is 500 nanoseconds, it takes 2 seconds for a channel to scan 
main memory. The total re-alignment times for systems without CEMs is 11 seconds. On 
the other hand, when a fault occurs in a QMR-CEM system, with a probability greater 
than 1 — 10 -5 , it will take less than 2.045 seconds to complete memory re-alignment. 

5 Conclusion 

The reliability of redundant computing systems used for ALS is analyzed and some 
design issues are discussed. The concept of access set is used for the analysis of multiple 
channel faults leading to system crash. When fault arrivals are independent and the system 
is free of error propagation and latent faults, the probability of system crash due to multiple 
channel faults is dictated primarily by component failure rates. It is shown that with the 
state-of-the-art technology, the probability of system crash due to multiple channel faults 
is insignificant even when the system size is fairly large. 

The case study of ALS has shown that the chief cause of unreliability in large redundant 
systems is the depletion of hardware resources (as a result of component failures), especially 
when the system has a long mission time. It is worth mentioning that our evaluation of 
the effectiveness of CEMs in the ALS is very conservative, because all transient faults 
are assumed to be recoverable by either NMILCEM or conventional NMR systems. Since 
transient faults are typically 10 times more frequent than permanent faults [8, 9], the 
reliability improvement by using CEMs would be even greater when conventional systems 
do not have perfect recovery capability for transient faults. 

Although emerging new technologies continue to improve hardware reliability and per- 
formance, they also stimulate new applications which require higher reliability and com- 
puting power. Thus, as main memory is the most vulnerable system component for the 
current technology, it is expected to be the reliability bottleneck in future computing sys- 
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terns. Fortunately, the design of CEMs for main memory is very simple, and very high 
fault coverage can be achieved with low overhead. For the example discussed in this report, 
about 22% of the memory overhead was induced for each channel to embed SEC/DED codes 
into its main memory. By contrast, adding channels or increasing redundancy will increase 
overheads substantially more in the power, physical size and channel synchronization of the 
system. Thus, embedding SEC/DED codes into main memory is a much more cost-effective 
method to prolong the resource depletion time than adding more channels to the system. 

Large main memory coupled with slack voters makes memory re-alignment very time- 
consuming. Thus, memory re-alignment in a large system should be avoided whenever 
possible. It is shown in this report that CEMs can dramatically reduce the need of memory 
re-alignment, and can speed up the re-alignment process substantially. 

Another serious threat to memory re-alignment is the propagation of errors. If error 
propagation is not effectively prevented, the number of contaminated pages will increase 
quickly, and thus, the number of pages needing to be re-aligned will increase quickly. Error 
propagation can be prevented only when the system has very good error detection capability. 
This is a matter of our future research. 
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Appendix A: List of Symbols 


AS',u 

Ik,c 

Mi 

m 

(0,^(0 

NMR-CEM 

QMR 

Pc(Yi), P N (t) 
Pk(F = /) 

Ti,Yi 

VJ, S{ 

x i 


The t-th access set in the system. AS' is essentially a set of memory words 
that will be accessed continuously by the CPU (active agent) for a period of 
time, u is the size of access set. 

K is the number of recovery pages in the system, fa is an upper bound for 
/, the number of faulty recovery pages, such that P(t ra > K + /$) < c. 


is the length of time that the active agent stays in ASk during its i th 
visit to ASk • 

m is the number of access sets in the system. 


m 

N(t) — ^iV l (t), where JV'(t) is the number of the agent’s visits to >15’ by 
1=1 

time t , and N(t) is the total number of visits to access sets by the active 
agent during [0, f). 

NMR-CEM is an N modular redundant system with CEMs embedded into 
each channel. QMR is a quadruplex modular redundant system. 

P c (Yi) is the probability of a channel becoming faulty during time interval 
Y{. Ppf(t) is the probability of system crash caused by multiple channel faults 
during time interval [0,t). 

Pk{F = /) is the probability of / recovery pages becoming faulty when the 
number of recovery pages is K , and P/c(t r a = K + fjf) is the probability of 
the re-alignment time = K + /^. 

T, is the time the :-th vote is held, Yi is the interval between T,_i and T,-. 

V£ is the event that the active agent makes the k - th visit to AS'. S' k is the 
moment the event V k begins. 


Xj is a random variable denoting the number of fault occurrences to .45' 
during the agent’s j-th visit to access sets. 
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A 0 , A p , A, A„, A p , A,, A m ,A 0 are the failure rates of an access set, a processor (including 

A m , A 0 control logics), interstage, RAM memory, and ROM memory of each channel 

in the system, respectively. 




n, n# 
n Q 


Qi 


fij is the ratio test of fif = (e^ — 1), where $ is the product 

of memory size (words), failure rate of a memory word, number of redundant 
channels, and the time i. 

II^ and n<j are the environmental and quality factors of a component, re- 
spectively. Component failure rate is adjusted by A' = Ilf; x n<j X A. 

fi t is the i-th recovery page. 
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Figure 1: The voting and communication network of computing channels in FTP. 
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Figure 2: The total probability of multiple channel faults with different access set sizes and 
mission times. 
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Figure 5: The unreliability of different systems when SEC/DED codes are embedded into 
ROM and RAM where (a) II = 0.1, and (b) II = 1. 
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Figure 6: The probability of FTP having fault masking capability before launching when 
SEC/DED codes are embedded into RAM. 
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Figure 7: Probability distribution functions of memory re-alignment times when t = 200 
hours, the system has 4 channels, each with 4 M words memory. 
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Figure 8: The cost function of a system with perfect detection capability, 4 channels, 4M 
words, t = 150 hours, A = 0.75 X 10 _9 /hour-word, and e = 10 -5 . (a) The global plot, and 
(b) a blow up of the cost function around the optimal point. 






